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Abstract 

Principal components analysis (PCA) is a standard tool for identifying good low- 
dimensional approximations to data sets in high dimension. Many current data sets of 
interest contain private or sensitive information about individuals. Algorithms which 
operate on such data should be sensitive to the privacy risks in publishing their out- 
puts. Differential privacy is a framework for developing tradeoffs between privacy and 
the utility of these outputs. In this paper we investigate the theory and empirical per- 
formance of differentially private approximations to PCA and propose a new method 
which explicitly optimizes the utility of the output. We demonstrate that on real data, 
there this a large performance gap between the existing methods and our method. We 
show that the sample complexity for the two procedures differs in the scaling with the 
data dimension, and that our method is nearly optimal in terms of this scaling. 

1 Introduction 

Dimensionality reduction is a fundamental tool for understanding complex data sets that 
arise in contemporary machine learning and data mining applications. Even though a single 
data point can be represented by hundreds or even thousands of features, the phenomena 
of interest are often intrinsically low-dimensional. By reducing the "extrinsic" dimension of 
the data to its "intrinsic" dimension, analysts can discover important structural relation- 
ships between features, more efficiently use the transformed data for learning tasks such 
as classification or regression, and greatly reduce the space required to effectively store 
the data. One of the oldest and most classical methods for dimensionality reduction is 
principal components analysis (PCA), which computes a low-rank approximation to the 
(positive semidefmite) second moment matrix of a set of points in M. d . The rank k of the 
approximation is chosen to be the intrinsic dimension of the data. We view this procedure 
as specifying a A;- dimensional subspace of R d . 

Much of today's machine-learning is performed on the vast amounts of personal infor- 
mation collected by private companies and government agencies about individuals, such as 
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customers, users, and subjects. These datasets contain sensitive information about individ- 
uals and typically involve a large number of features. It is therefore important to design 
machine-learning algorithms which discover important structural relationships in the data 
while taking into account its sensitive nature. We study approximations to PCA which guar- 



antee differential privacy, a cryptographically motivated definition of privacy |Dwork et al. 



and data-mining 


; communities [Machanavajjhala et al. 


2008, 


McSherry and Mironov 


2009 


McSherry 


2009, 


Friedman and Schuster, 2010, Mohammed et al. , 2011 . Differential privacy 



measures privacy risk by a parameter a that bounds the log-likelihood ratio of output of a 
(private) algorithm under two databases differing in a single individual. 

There are many general tools for providing differential privacy. The sensitivity method [rework 



et al. , 2006b computes the desired algorithm (PCA) on the data and then adds noise pro- 
portional to the maximum change than can be induced by changing a single point in the 
data set. The PCA algorithm is very sensitive in this sense because the top eigenvector 
can change by 90° by changing one point in the data set. Relaxations such as smoothed 
sensitivity 



Nissim et al 



method of Blum et al. 2005 



2007 are difficult to compute in this setting as well. The SULQ 
adds noise to the second moment matrix and then runs PCA 
on the noisy matrix. As our experiments show, the amount of noise required is often quite 
severe and SULQ seems impractical for data sets of moderate size. 

These general methods do not take into account the quality of approximation to the 
non-private PCA output. We address this by proposing a new method, PPCA, that is an 
instance of the exponential mechanism of McSherry and Talwar 20071 . This differentially 



private method outputs a /c-dimensional subspace; the output is biased towards subspaces 
which are close to the output of PCA, and in our case corresponds to sampling from the 
matrix Bingham distribution. We implement this method using a Markov Chain Monte 



Carlo (MCMC) procedure due to Hoff 2009 and show that it achieves significantly better 
empirical performance. 

In order to understand this gap in performance we prove sample complexity bounds 
for SULQ and PPCA, as well as a general lower bound on the sample complexity for any 
differentially private algorithm. We show that the sample complexity scales as Sl(d?l 2 ^fd) 
for SULQ and as 0(d) for PPCA. Furthermore, any differentially private algorithm requires 
Q(d) samples, showing that PPCA is nearly optimal in terms of sample complexity. These 
theoretical results suggest that our experiments exhibit the limit of how well a-differentially 
private algorithms can perform. 

There are several interesting open questions suggested by this work. One set of issues 
is computational. Differentially privacy is a mathematical definition, but algorithms must 
be implemented using finite precision machines. Privacy and computation interact in many 
places, including pseudorandomness, numerical stability, optimization, and in the MCMC 
procedure we use to implement PPCA. A second set of issues is theoretical - our theoretical 
analysis for the sample complexity is for the special case of k = 1 in which the distance and 
angles between vectors are related. For general k the measure of approximation may be 
application dependent; distances between subspaces and frames may be different and lead 
to different tradeoffs. 
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Related work 



There are many different models for privacy-preserving computation Agrawal and Srikant 



2000 


, Evfimievski et al. 


- see 


Fung et al. 


|2010] i 



2006 



Li et al. 



2010 



shown to be susceptible to composition attacks, which are attacks in which the adversary 

An alternative line of 



exploits prior knowledge to reidentify individuals Ganta et al. 



Computation setting due to |Yao 1982 



decomposition in this model |Han et al. 



2008 



privacy-preserving data- mining work Zhan and Matwin, 2007 is in the Secure Multiparty 

; one work studies privacy-preserving singular value 
Finally, dimension reduction through random 



2009 



projection has been considered as a technique for sanitizing data prior to publication |Liu 



et al. , 2006 ; our work differs from this line of work in that we offer differential privacy 
guarantees, and we only release the PCA subspace, not actual data. 

and has been used since in 



Differential privacy was proposed by Dwork et al. 2006b 



many works on privacy [Kasiviswanathan et al 



2008 



Chaudhuri and Mishra, 2006, Barak 



et al.[ |2007"1 |Machanavajjhala etakj |2008[ [McSherry and Mironov[ |2009[ [Chaudhuri et al. 



20TTj |Nissim et aT| [20071 |Blum et al.] [20081 |McSherry and Talwar[ [20071 |Friedman and 



Schuster! |2010[ |Hay et al.[ |2009[ |Williams and McSherry] |2010l |Roy et al. 



ferential privacy has been shown to have strong semantic guarantees Dwork et al 



2010 



Dif- 



2006b 



Kasiviswanathan and Smith, 2008 and is resistant to many attacks |Ganta et al. 2008 that 



succeed against some of the other aforementioned definitions of privacy. There are several 
standard approaches for designing differentially-private data-mining algorithms, including 
input perturbation Blum et al. 2005 , output perturbation |Dwork et al. 2006b , the ex- 



et al., 2011 



ponential mechanism McSherry and Talwar, 2007 , and objective perturbation Chaudhuri 



The SULQ method Blum et al., 2005 



gave a general differentially-private input per- 
turbation algorithm and showed that it could be applied to PCA. |Williams and McSherry| 
[2010 described a method for inferring one-dimensional subspace from noisy measurements 

these by tuning the 



2005 



using differential privacy. Our work differs from Blum et al. 
mechanism to a utility function for PCA and from Williams and McSherry |2010] by not 
making modeling assumptions on the data. Independently of our work, Hardt and Roth 
2012 1 consider the problem of differentially-private low-rank matrix reconstruction with a 
view towards applications to sparse matrices; provided certain coherence conditions hold, 
they provide an algorithm for constructing a rank 2k approximation B to a matrix A such 
that \\A — B\\p is 0( ||^4 — plus some additional terms which depend on d, k and n; 
here is the best rank k approximation to A. Because of their additional assumptions, 
their bounds are generally incomparable to ours; however, it can be shown that our bounds 
are superior for dense matrices A. 

In many differentially-private algorithms that use input perturbation, the magnitude of 
the noise needed to guarantee differential privacy often grows linearly with the extrinsic di- 



mension of the data. Examples include differentially-private means and covariances Dwork 
et al. 2006b| and differentially-private PCA and k- means [Blum et al. 2005| . Another ex- 
ample is linear classification; Chaudhuri et al. [201 1 present linear classification algorithms 
in which the magnitude of the added perturbation, as well as the sample requirement, grows 
linearly with the extrinsic dimension of the dataset. To the best of our knowledge, the only 
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prior work which considered differentially-private PC A is SULQ [Blum et al. 2005| ; SULQ 
adds noise to each entry of the empirical second moment matrix, and then outputs the top 
k singular vectors of this perturbed matrix. Our experiments as well as theoretical results 
indicate that our solution has higher statistical efficiency, particularly when k is small com- 



pared to the extrinsic data dimension. However, it has been recently shown by Hall et al. 



[2012] that (a, <5)-differential privacy can provide a significantly better dependence on the 
dimension than a-differential privacy. 

Lower bounds on the sample requirement of differentially private algorithms for statis- 
tical tasks is also an active area of research. Hardt and Talwar 2010| prove lower bounds 
on the amount of noise that needs to be added to a number of histogram count queries, and 



Chaudhuri and Hsu 2011 show lower bounds on the sample requirement of differentially- 
private classification as a function of the doubling dimension of the disagreement metric of 
the hypothesis class with respect to the data distribution. Our lower bounds on the sample 
requirement of any differentially-private algorithm extends ideas from these two papers. 



2 Preliminaries 

For an integer d, let [d] = {1, 2, . . . , d}, let E> d ~ l be the unit sphere in M. d , and let Id be the 
d X d identity matrix. The transpose of a matrix A is denoted by A T . If A is positive definite 
we denote by \i(A) the i-th largest eigenvalue of A. When Xi(A) has multiplicity one, we 
denote by Vi(A) the corresponding eigenvector. We use Id to denote the d x d identity 
matrix. We will use different norms in this paper : ||A|| F = ti{AA T ) is the Frobenius norm 
of a matrix A and ||x|| 2 is the Euclidean norm. We will write KL = J f( x ){^0jdx for 

the Kullback-Leibler divergence between two densities / and g. 

The data given to our algorithm is a set of n vectors T> = {x\, X2, ■ ■ ■ , x n } where each 
Xi corresponds to the private value of one individual, X{ S M. d , and \\xi\\ < 1 for all i. Let 
X = f x\, . . . , x n ] be the matrix whose columns are the data vectors {xj}. Let A — ^-XX^ 
denote the d x d second moment matrix of the data. The matrix A is positive semidefinite, 
and has Frobenius norm at most 1. 

The problem of dimensionality reduction is to find a "good" low-rank approximation 
to A. A popular solution is to compute a rank-/c matrix A which minimizes the norm 
where k is much lower than the data dimension d. The Schmidt approximation 



A- A 



F 



theorem Stewart, 1993 Eckart and Young 1936 shows that the minimizer is given by the 



singular value decomposition, also known as the PCA algorithm in some areas of computer 
science. Let the eigenvalues of A be Xi(A) > A2(^4) > • • • > Xd(A) > and let A be a 
diagonal matrix with An = \i(A). The matrix A decomposes as 

A = VAV T , (1) 

where V is an orthonormal matrix of eigenvectors. 

Definition 1. Suppose A is a positive semidefinite matrix whose first k eigenvalues are 
distinct. The top-k subspace of A is the matrix 

V k (A) = [ Vl v 2 ■■■ v k }. (2) 
where = Xi(A) for i = 1, 2, . . . , k and A^ = elsewhere, and V is given by ([!]). 
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Given the top-k subspace and the eigenvalue matrix A, we can form an approximation 
to A: 

A^ = [V k (A)0]A[V k (A)0] T , (3) 

In the special case k = 1 we have A^ = Xi(A)viVi , where v\ is the eigenvector correspond- 
ing to \\{A). We refer to v\ as the top eigenvector of the data. 

We study approximations to A that preserve the privacy of the underlying data. The 
notion of privacy that we use is differential privacy, which quantifies the privacy guaranteed 
by a randomized algorithm V applied to a data set T>. 

Definition 2. An algorithm A(B) taking values in a set T provides a -differential privacy 
if 

H(S | B = V) ^ a ... 
s v,v> n\S \B = V) 

where the first supremum is over all measurable S C T, the second is over all data sets T> 
and T>' differing in a single entry, and n(-\B) is the conditional distribution (measure) on 
T induced by the output A(B) given a data set B. The ratio is interpreted to be 1 whenever 
the numerator and denominator are both 0. 

Definition 3. An algorithm A(B) taking values in a set T provides (a, 8) -differential pri- 
vacy if 

P (A(V) £ S) < e a F (A(V) £ S) + 5, (5) 
for all all measurable S C T and all data sets T> and T>' differing in a single entry. 

Here a and 5 are privacy parameters, where low a and 5 ensure more privacy. For 



Dwork et al. 2006a 



more details about these definitions, see Dwork et al. 2006b , Wasserman and Zhou |2010 



The purpose of PCA is often to approximate the data by a low dimensional subspace 
that captures most of the "energy" of the data. For a d x k matrix V with orthonormal 
columns, the quality of V in approximating A can be measured by 

q F (V) = tr (y T AV^j . (6) 

The V which maximizes q(V) has columns equal to {vi : i G [k]}, corresponding to the top 
k eigenvectors of A. 

Our theoretical results apply to the special case k = 1, although they can be extended 
for general k. For these results, we measure the inner product between the output vector 
vi and the true top eigenvector v\: 

Qa(vi) = \(vi,vi}\ . (7) 

This is related to ([6]). If we write v\ in the basis spanned by {i>j}, then 

d 

q F (vi) = Aig A ({)i) 2 + ^2 h(v\,Vi) 2 

1=2 

Our proof techniques use the geometric properties of (?a( - )- 
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Definition 4. A randomized algorithm A(-) is an (p,r/)-close approximation to the top 
eigenvector if for all data sets D of n points, 

F(q A (A(V)) >p) > l-rj, (8) 

where the probability is taken over A(-). 

In this paper we are interested in proving results on the sample complexity of differen- 
tially private algorithms that approximate PCA. That is, for a given a and p, how large 
must the number of individuals n in the data set be such that it is a-differentially private 
and also a (p, r/)-close approximation to PCA? It is well known that as the number of indi- 
viduals n grows, it is easier to guarantee the same level of privacy with relatively less noise 
or perturbation, and therefore the utility of the approximation also improves. Our results 
characterize how privacy and utility scale with n and the tradeoff between them for fixed 
n. 



3 Algorithms and main results 

In this section we describe differentially private techniques for approximating Q. The 



first is a modified version of the SULQ method Blum et al. 2005 . Our new algorithm 
for differentially-private PCA, PPCA, is an instantiation of the exponential mechanism due 
to McSherry and Talwar 20071 . Both procedures provide differentially private approxima- 



tions to the top-fc subspace: SULQ provides (a, <5)-differential privacy and PPCA provides 
a-differential privacy. 

3.1 Input perturbation 

The only differentially- private approximation to PCA prior to this work is the SULQ method 
|Blum et al. 2005 . The SULQ method perturbs each entry of the empirical second moment 



matrix A to ensure differential privacy and releases the top k eigenvectors of this perturbed 
matrix. In particular, SULQ recommends adding a matrix N of i.i.d. Gaussian noise of vari- 
ance 8d '"I a^*^ and applies the PCA algorithm to A + N. This guarantees slightly a weaker 
privacy definition known as (a, ^-differential privacy. One problem with this approach is 
that with probability 1 the matrix A+N is not symmetric, so the largest eigenvalue may not 
be real and the entries of the corresponding eigenvector may be complex. Thus the SULQ 
algorithm is not a good candidate for practical privacy-preserving dimensionality reduction. 

However, a simple modification to the basic SULQ approach does guarantee (a, 5) dif- 
ferential privacy. Instead of adding a asymmetric Gaussian matrix, the algorithm can add 
the a symmetric matrix with i.i.d. Gaussian entries N. That is, for 1 < i < j < d, the 
variable Nij is an independent Gaussian random variable with variance (3 2 . Note that this 
matrix is symmetric but not necessarily positive semidefinite, so some eigenvalues may be 
negative but the eigenvectors are all real. A derivation for the noise variance is given in 
Theorem [TJ 
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Algorithm 1: Algorithm MOD-SULQ (input pertubation) 



inputs: d x n data matrix X, privacy parameter a, parameter 5 
outputs: d x k matrix V]j. = [i>i i>2 ■ ■ ■ Vk] with orthonormal columns 
Set A = \XX T . 
Set 

"-a y V52V27r/ Van 

Generate a d x d symmetric random matrix TV whose entries are i.i.d. drawn from 
AA(0,/3 2 ).^ 

Compute Vfc = V k (A + N) according to Q. 



3.2 Exponential mechanism 



Our new method, PPCA, randomly samples a /c-dimensional subspace from a distribution 
that ensures differential privacy and is biased towards high utility. The distribution from 
which our released subspace is sampled is known in the statistics literature as the matrix 
Bingham distribution |Bingham 1974, Chikuse, 2003 , which we denote by BMFk(B). The 



algorithm is in terms of general k < d but our theoretical results focus on the special case 
k = 1 where we wish to release a one-dimensional approximation to the data covariance 
matrix. The matrix Bingham distribution takes values on the set of all fc-dimensional 
subspaces of M. d and has a density equal to 



f(V) 



1 



l F 1 (|fe,id,B) 



exp(tr(y T BV)), 



(10) 



where V is a dx k matrix whose columns are orthonormal and 1 F 1 (hk,^d, B) is a confluent 
hypergeometric function | Chikuse 2003 p. 33]. Numerical approximations of this constant 
can be calculated using a variety of techniques |Butler and Wood , 2002 Koev and Edelman 
2006 



Algorithm 2: Algorithm PPCA (exponential mechanism) 



inputs: d x n data matrix X, privacy parameter a, dimension k 
outputs: d x k matrix Vk = [vi V2 ■•• Vk] with orthonormal columns 
Set A = lXX T 



Sample V k = BMF (nf A) 



By combining results on the exponential mechanism iMcSherry and Talwar 2007 along 
with properties of PCA algorithm, we can show that this procedure is differentially private. 
In many cases, sampling from the distribution specified by the exponential mechanism dis- 
tribution may be difficult computationally, especially for continuous-valued outputs. We 
implement PPCA using a recently-proposed Gibbs sampler due to Hoff 2009| . Gibbs sam- 
pling is a popular Markov Chain Monte Carlo (MCMC) technique in which samples are 
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generated according to a Markov chain whose stationary distribution is the density in (10). 
Assessing the "burn-in time" and other factors for this procedure is an interesting question 
in its own right; further details are in Section 6.2 



3.3 Other approaches 

There are other general algorithmic strategies for guaranteeing differential privacy. The 



sensitivity method Dwork et al. 2006b| adds noise proportional to the maximum change 



that can be induced by changing a single point in the data set. Consider a data set T> with 
m+1 copies of a unit vector u and m copies of a unit vector u' with ulu' and let T>' have m 
copies of u and m + 1 copies of v! . Then v\{V) = u but v\(V') = u' , so — = 

Thus the global sensitivity does not scale with the number of data points, so as n 
increases the variance of the noise required by the Laplace mechanism [Dwork et al. 2006b 



will not decrease. An alternative to global sensitivity is smooth sensitivity Nissim et al. 



2007 ; except for special cases, such as the sample median, smooth sensitivity is difficult 



to compute for general functions. A third method for computing private, approximate 



solutions to high-dimensional optimization problems is objective perturbation |Chaudhuri 



et al. 2011 ; to apply this method, we require the optimization problems to have certain 



properties (namely, strong convexity and bounded norms of gradients), which do not apply 
to PCA. 



3.4 Main results 

Our theoretical results are sample complexity bounds for PPCA and MOD-SULQ as well as 
a general lower bound on the sample complexity for any a-differentially private algorithm. 
These results show that the PPCA is nearly optimal in terms the scaling of the sample 
complexity with respect to the data dimension d, privacy parameter a, and correlation p. 
We further show that MOD-SULQ requires more samples as a function of d, despite having 
a slightly weaker privacy guarantee. 

Even though both algorithms can output the top-A; PCA subspace for general k < d, 
we prove results for the case k = 1. Proving analogous results for general k would require 
significantly more analysis and calculations, whereas for k = 1 the scaling behavior is cleaner 
to elucidate. Finding the scaling behavior of the sample complexity with k is an interesting 
open problem that we leave for future work. 

The fact that these two algorithms are differentially private follows from some simple 
calculations. 

Theorem 1. For the (3 in the MOD-SULQ algorith m is (a, 5) differentially private. 

Theorem 2. Algorithm PPCA is a-differentially private. 

Our first sample complexity result provides an upper bound on the number of samples 
required by PPCA to guarantee a certain level of privacy and accuracy. The sample com- 
plexity of PPCA n grows linearly with the dimension d, inversely with a, and inversely with 
the correlation gap (1 — p) and eigenvalue gap \\(A) — \2{A). 
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Theorem 3 (Sample complexity of PPCA). // 

d /logfl/n) 4Ai \ 

a{l - p){Xi - A 2 ) V d (1 — yo^)(Ai - X 2 ) J 

then the top PCA direction v\ and the output of PPCA v\ with privacy parameter a satisfy: 

Pr(|(«i,*i)| > p) > I-77. 

That is, PPCA is a (p,n)-close approximation to PCA. 

Our second result shows a lower bound on the number of samples required by any a- 
differentially-private algorithm to guarantee a certain level of accuracy for a large class of 
datasets. 

Theorem 4 (Sample complexity lower bound). There are absolute constants c and d such 
that if 

d , 1 

n<c-— \-c--r-, c, ( 12 ) 

a(l — p) a(l — p) 

then for any a- differentially private algorithm A which computes an approximation v± to 
the top PCA direction, there exists a data set for which Ai — A2 > \, and yet 

E A [\(v llVl )\]<p. 



Theorem |3| shows that if n scales like a (i_ p ) then PPCA produces an approximation 



shows that n must scale like 



v\ that has correlation p with v\, whereas Theorem 1 ohuwd unau ,0 muoi o^aic ±i*vc a n_ p \ 
for any o-differentially private algorithm. In terms 01 scaling with d and a the upper and 
lower bounds match, and they also match up to logarithmic factors with respect to the 
correlation. By contrast, the following lower bound on the number of samples required by 
MOD-SULQ to ensure a certain level of accuracy shows that MOD-SULQ has a less favorable 
scaling with dimension. 

Theorem 5 (Sample complexity lower bound for MOD-SULQ). There are constants c and 
d such that if 



d 3 /yio g (d/5) 

; 1 



n < c v 6W ' (I - c'(l - p)), 

a 

then there exists a dataset of size n in dimension d, such that the top PCA direction v anc 
the output v of MOD-SULQ satisfy 

E[|(t>i,«i>|] <p. 



Notice that the dependence on n grows as d 3/2 in SULQ as opposed to d in PPCA. 
Dimensionality reduction via PCA is often used in applications where the data points occupy 
a low dimensional space but are presented in high dimensions. These bounds suggest that 
PPCA is better suited to such applications than MOD-SULQ. We next turn to validating 
this intuition on real data. In the next two section we provide proofs of these results. 
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4 Analysis of PPCA 



In this section we provide theoretical guarantees on the performance of PPCA. The proof of 
Theorem [2] follows from the results on the exponential mechanism |McSherry and Talwar 
2007|. To find the sample complexity of PPCA we bound the density of the Bingham 



distribution, leading to a sample complexity for k = 1 that depends on the gap Ai — A2 
between the top two eigenvalues. We close with a general lower bound on the sample 
complexity that holds for any a-differentially private algorithm. The lower bound matches 
our upper bound up to log factors, showing that PPCA is nearly optimal in terms of the 
scaling with dimension, privacy a, and utility (/a(')- 

4.1 Privacy guarantee 

Proof of Theorem [M Let X be a data matrix whose i-th column is Xi and A = ±XX T . The 



PP-PCA algorithm is the exponential mechanism of McSherry and Talwar 2007| applied to 
the score function 

q${X, v) = n ■ v T Av 

Consider X' = [x\ x<i • • • x n -\ x' n ] differ from X in a single column and let A' = ^X'X ;T . 
We have 

max \qp(X', v) — qp(X, v) I < \v T (x'x'I' — x n x^)v\ 



< 



I T I II 2 _ II T ||2 



< 1. 



The result then follows immediately from the results of McSherry and Talwar |2007 , Theo- 
rem 6]. □ 



4.2 Upper bound on utility 



The results of McSherry and Talwar |2007| bound the gap between the value of the function 
Qf{vi) = n ■ v{ Avi evaluated at the output v\ of the mechanism and the optimal value 
q{v\) = n ■ Ai. We derive a bound on the correlation ^(^1) = |(^i> u i)| y i a geometric 
arguments. To do this we need the a lower bound on the area of a circular cap. 



Lemma 6 (Lemmas 2.2 and 2.3 of |Ball 1997| ). Let fj, be the uniform measure on the unit 
sphere §> d_1 . For any x G S d_1 and < c < 1 



exp 



d- 1 



log J—^j ~ V {{ v € Sd 1 : ( v > x ^ ~ c }) ~ exp (~ dc2 l 2 ) ■ ( 13 ) 



Proof of Theorem^ Fix a privacy level a, target correlation p, and probability r/. Let X 
be the data matrix and B = (a/2)XX T and 



{u : \{u,vi)\ > p} 
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be the union of the two spherical caps centered at ±t>i. Let IA P denote the complement of 

An output vector v\ is "good" if it is in tl p . We first give some bounds on the score 
function qp{u) on the boundary between U p and U p , where (u,v±) = ±p. The function 
qF(u) is maximized when u is a linear combination of v\ and v 2 , the top two eigenvectors 
of A. It minimized when u is a linear combination of v\ and v^. Therefore 



na 



qF(u)<~(p 2 \i + (l-p 2 )\ 2 ) ueU p 



na 



qp(u) > ^{p 2 \i + {l- p 2 )\ d ) u£U p 



(14) 
(15) 



Let p(-) denote the uniform measure on the unit sphere. Then fixing an < b < 1, 
and t 



using (14), (15), and the fact that > 0, 

w(U p 



P(Z4) 

i 

1 F 1 (|*,|m,B) 



exp {u T Bu) dp 



1 F 1 (ik,i m ,B) Su. eX P( uTBu ) d t i 
exp (n(q/2) (p 2 Ai + (1 - p 2 )\ 2 )) ■ p (U p ) 
exp (n(a/2) (a 2 Ai + (1 - a 2 )A d )) • /i (W CT ) 

<exp(-^(a a Ai-(^Ai + (l- A > 2 )A a ))) 



< 



(16) 



Applying the lower bound from Lemma [6] to the denominator of (16) and the upper bound 
P (Up) < 1 yields 



p^) ~2~ (fT Al " (p Al + (1 " p )A2)) - — log r^- 

We must choose a a 2 > p 2 to make the exponent positive, but more precisely, 



(17) 



a 2 > p 2 + (1 - p 



l-a 2 <(l-p 2 



2\ ^2 



For simplicity, choose 



l-a z 



Ai 
Aa 
" Ai 



A2 
Ai 



So that 



^Ai - (p 2 Ai + (1 - p 2 )\ 2 ) 



(l-p 2 )Ai-(l-a 2 )Ai-(l-p 2 )A 2 
(1-p 2 ) fAi-i(Ai-A 2 )-A 



; (l-p 2 )(Ai-A 2 ), 



11 



and 



log < log 2 

1 — a 1 — a z 



4Ai 

° g (l-p2)( Al -A 2 )- 



Setting (17) greater than log(l/?y) yields 



na /1 2 \^\ \ w i 1 cZ — 1 1 4Ai 



-(1 - p 2 ){X l - A 2 ) > log - + — — log 



4 v 1 > K L " % 2 b (i-p2 )(Al _ A2 y 
Since 1 — p < 1 — /? 2 , if we choose 

d Aog(l/r?) . 4Ai 

n > — — — &v / " + log 



a(l-p)(Ai-A 2 ) V d to (l-p 2 )(Ai-A 2 ) 
then the output of PPCA will produce a v\ such that 

P(Kfli,tfi>| <p)<r,. 



□ 



The bound proved in the preceding theorem may be a little loose. It is difficult to 
measure the area on the unit sphere of the set {x : x T Ax > 1 — 7}. In the case where A = I 
this is just a spherical cap, but for general A it can have a more irregular shape. This is 
an artifact of the proof technique. A direct bound seems difficult because explicit bounds 
on the confluent hypergeometric function with matrix argument are not readily applicable; 



numerical approximation is still an open research question [Butler and Wood 2002 Koev 



and Edelman, 2006 



4.3 Lower bound on utility 

We now turn to a general lower bound on the sample complexity for any differentially private 
approximation to PCA. We construct K databases which differ in a small number of points 
whose top eigenvectors are not too far from each other; yet the gap Ai — A 2 between the top 
two eigenvalues is at least |. For such a collection, Lemma[7] shows that for any differentially 
private mechanism, the average correlation over the collection cannot be too large. That is, 
even if the eigengap between the first two eigenvalues is large, any a-differentially private 
mechanism cannot have high utility on all K data sets. The remainder of the argument is 
to construct these K data sets. 

The proof uses some simple eigenvalue and eigenvector computations. A matrix of 
positive entries 

,;:) < is > 

has characteristic polynomial 

det(A - XI) = A 2 - (a + c)A + (ac - b 2 ) 
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and eigenvalues 

A = l -(a + c) ± l -^/(a + cy-±(ac-V) 
= l -(a + c)±\^{a-cy + AV. (19) 

The eigenvectors are in the directions (6, —(a — A)) T . 

Lemma 7. Let T>i,T>2, ■ ■ ■ , fre X databases which differ in the value of at most ln( ' j ^~ 1 ^ 
points, and let u±, . . . , uk be the top eigenvectors ofVi,T>2, . ■ ■ , T^k- If A is any a -differentially 
private algorithm, then, 

K ( 1 

Y,^A [\(A(Vi),Ui)\] < K I 1 - — (1 - max|(u^)|) 

i=i ^ 



Proof. Let 

t = min(||uj - Uj\\ , \\ui + Uj\\), 
and Qi be the cap around ±Uj of radius i/2: 

Qi = {u : \\u — Ui\\ < t/2} U {u : \\u + Uj|| < i/2} . 

We claim that 

K 1 

^P^(^)^G,)> 2^-1)- (20) 

i=l 

The proof is by contradiction. Suppose the claim is false. Because all of the caps Qi are 
disjoint, and applying the definition of differential privacy, 

1 K 

-{K-l)>YVA(AV> i )tG i ) 
i=l 
K 

i=l i'^j 

> EE e_a ' In(i " _1)/ap ^(^') G 

i=l i'^j 

1 K 

i=i 

> K _ i(tf-i), 
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which is a contradiction, so (20) holds. Therefore by the Markov inequality 

K K 

J^^A [mm(\\A(Vi) - Ui f , \\MPi) + uif)] > ^P(X(Pi) i Gi) ■ 
i=i i=i 

>\(K-l)t\ 

Rewriting the norms in terms of inner products shows 

K 1 
2K-2Y,^A[\(A(V i ),u i )\] > -(K-l)(2-2max\(u i ,u j )\), 



i=i 



so 



K 



J2^A [\(A{Vi),Ui)\] <K\l- 1*^(1 »,ax |<„,, ,,)!) 
i=l 



- K { 1 ~ ^(1 " max | (n,,^) I)) 



□ 



Proof of Theorem^ Let y = e^, the (f-th coordinate vector, and let <j> G ((27rd) 1 ^ 2 ,1) 
Lemma 12 shows that there exists a packing W = {wi, W2, • • • , wk} of the sphere §' 



d-2 



spanned by {ei, e2, • • • , e^-i} such that max^j iOy)| < 0, where 



if = -(1 



-(d-2) 1 2 



Let /? = 1 - > , For i = 1, 2, . . . , K let the data set V; contain 

• n(l — /3) copies of y 

• n/3 copies of Z{ = yffiy + \/l — fiwi. 
The second moment matrix As of 2?,- is 



Ai = (1 - /3 + /3 2 )yy T + - /3 2 K T y + y^ T ) + /3(1 - P) Wi wf . 
By choosing an basis containing y and itfj, we can write this as 



A 



1-/3 + /3 2 P^fJ^W 
/3^^2 /3(l-/3) 




This is in the form @, with a = l- /3 + /3 2 , 5 = /3i//3 - f3 2 , and c = /3(1 - (3). The top 
eigenvector is therefore 



(a- A) 
y + - 7 =^==L==w i . 



^V + {a-\y ^b 2 + (a - A) 
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where A is given by ( |19| ). We observe that the gap between the top two eigenvalues is: 
y/(a - c) 2 + 46 2 = V(l - 2/3(1 - /3)) 2 + 4/3 2 (/3 - /3 2 ) > l - 



Therefore 

b 2 (a - A) 2 
max , > < — ; - — 7T + T7i tttt ma: 



6 2 + (a - A) 2 
To continue the upper bound we compute A: 

A = -(a + c) + ^V( fl -c) 2 + 4&2 



<l- lA^ (l-^ (21) 



J (l + V(l - 2/3 + 2/3 2 ) 2 + 4(/3 3 - ^ 



2 



- (l + v 7 ! + 4/3 2 + 4/3 4 - 4/3 + 4/3 2 - 8/3 3 + 4/3 3 - 4/3 4 



2 



1 + ^1 - 4/3 + 8/3 2 - 4/3= 
i (l + ^(1 - 4/3(1 - /3) 2 ) . 



2 

A Taylor series expansion of \/l — x implies that for x < 1, 

1 - x < Vl-x < 1 - -. 

2 

Applying this equation on A, we get: 

1- 2/3(1- /3) 2 < A< 1-/3(1- ^) 2 . (22) 

Note that ^ ^ 

A > -(a + c) + -y^a-c) 2 = a, 

so (a — A) 2 is maximized when A is as large as possible. Therefore 

(a - A) 2 < (1 - /3 + /3 2 - 1 + /3(1 - /3) 2 ) 2 
= (-/3(l-/3) + /3(l-/3) 2 ) 2 
= /3 4 (l-/3) 2 . 

Moreover, 

6 2 = /3 2 (/3 -/3 2 ) = /3 3 (l-/3). 
Since for v > 0, xj{y + x) is monotonically increasing in x, we have 

(a -A) 2 < /3 4 (l-/3) 2 



& 2 + (a - A) 2 " /3 4 (1 - /3) 2 + /3 3 (1 - p) 
/3(l-/3) 



< 



l + /3(l-/3) 
</3- 
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Combining this with Equation 21 we get: 



max \{m, Uj)\ < 1 — (3(1 — (ft). 
Applying Lemma [7] shows that for at least one database i 

Ea lUiV^Ui)]] < 1 - 1(1 - (1 - 0(1 - 0))) 

= 1-1(1-0). 

16 v Vl 

Setting the left side equal to some target expected correlation p, 

(i-rt>^(i-#) 



16na 



Thus 



1 fl-4> ( 1 , 1 

Setting (ft as some constant close to 1 shows that there exist constants c' and c" such that 

d „ 1 



n > c — r — c 

a(l — p) q(1 — p) 



□ 



5 Analysis of MOD-SULQ 

In this section we provide theoretical guarantees on the performance of the MOD-SULQ 
algorithm. Theorem [T] shows that MOD-SULQ is (a, 5)-differentially private. Theorem 14 



provides a lower bound on the distance between the vector released by MOD-SULQ and 
the true top eigenvector in terms of the privacy parameters a and 5 and the number of 
points n in the data set. This implicitly gives a lower bound on the sample complexity of 
MOD-SULQ. We provide some graphical illustration of this tradeoff. 

The following upper bound will be useful for future calculations : for two unit vectors 
x and y, 

{xiXj - yiyj) 2 < 2. (23) 

Note that this upper bound is achievable by setting x and y to be orthogonal elementary 
vectors. 



16 



5.1 Privacy guarantee 

We first justify the choice of (3 2 in the MOD-SULQ algorithm. 

Proof of Theorem^ Let B and B be two independent symmetric random matrices where 
{Bij : 1 < i < j ' < d} and {B^ : 1 < i < j < d} are each sets of i.i.d. Gaussian 
random variables with mean and variance j3 2 . Consider two data sets T> = {x{ : i £ [n]} 
and T> = T>\ U {x n } \ {x n } and let A and A denote their second moment matrices. Let 
G = A + B and G = A + B. We first calculate the log ratio of the densities of G and G at 
a point H: 



'G y l<i<j<d 

l<i<j<d V 



From (23) the last term is upper bounded by 2/n . To upper bound the first term, 

E\&n,i%n,j %n,i%n,j\ — 2 max / CL^Oj 
J J o:l|o||<l ' 

i<«<i<rf i<*<i<^ 

<2.^ + (i ).i 
= d+1. 

Note that this bound is not too loose - by taking x = d _1 / 2 l and x = (1,0,..., 0) T , this 
term is still linear in d. 

Then for any measurable set S of matrices, 

P(G€S)<expQj2 (^ + 1)7 + ^)) P ( 6g5 ) + F (Ai >7for alU,i). (24) 

To handle the last term, use a union bound over the (d 2 + d)/2 variables {B^} together 
with the tail bound, which holds for 7 > f3: 

P(Bii >l)<^=e^ 2 / 2 ^. 
V2tt 



Thus setting P (B^ > 7 for some = 5 yields the condition 



§ = d 2 + d c _^ /W 2 



Rearranging to solve for 7 gives 
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for d > 1 and 5 < 3/\/2vre. This then gives an expression for a to make (24) imply (a, 5) 
differential privacy: 



1 



a 



o«2 I -(<*+l)7+ 2 



2 



Solving for /? using the quadratic formula yields a particularly messy expression: 

1/2 



/3 = ^-l/ 21 °g( ) +— 2(d+l) 2 log' 



< 



2na 

d + 1 

na 



'21og(^±£|-|- 



52V27T 



2na 
1 

an 



52J2-K 



4a 



(25) 

□ 



5.2 Lower bound on utility 



The main tool in our lower bound is a generalization by Yu [1997 of an information-theoretic 
inequality due to Fano. 



Theorem 8 (Fano's inequality Yu, 1997] ). Let 1Z be a set and Q be a parameter space 
with a pseudo-metric d(-). Let J 7 be a set of r densities {fi, ■ ■ ■ , f r } on 1Z corresponding 
to parameter values {0\, . . . ,9 r } in Q. Let X have distribution f G T with corresponding 
parameter 8 and let 6{X) be an estimate of 6. If, for all i and j 



d(9 



> T 



and 



then 



KL(/ i ||/ i )< 7 , 



max E,- [d(0,0,-)] > - 
j 2 



7 + log 2 
log r 



(26) 



(27) 



(28) 



where Ej[-] denotes the expectation with respect to distribution fj. 

To use this inequality, we will construct a set of densities on the set of covariance 
matrices corresponding distribution of the random matrix in the MOD-SULQ algorithm 
under different inputs. These inputs will be chosen using a set of unit vectors which are a 
packing on the surface of the unit sphere. 
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5.3 A packing lemma 

The proof of this lemma is relatively straightforward. The following is a slight refinement 
of a lemma due to Csiszar and Narayan |1988] (See also Csiszar and Narayan 1991] ) . 

Lemma 9. Let Zi, Z2, . . . , Zjy be arbitrary random variables and let /i(Zi, . . . , Zj) 6e ar- 
bitrary with < /, < 1, i = 1, 2, . . . , N . Then the condition 



E[/ i (Z 1 ,...,Z i )|Zi,...,Z i _i] <a; a.s., 



1,2 



implies that 



\^Y J Ji{Z l ,...,Z i )>t S j <exp ^-iVt(log2) + ^ ( 



Proof. First apply Markov's inequality: 



\ i=l 



2^i=l /i( Z lv,Zi) ^ 2 



Art 



< 2 -JV *E 



2^i=l /i( Z lv: Z i) 

2E£ 1 / l (Zi,...,z I ) ]E 2 /iv(Zi,...,z JV )| Zi) ^ Zjv _ i 



Now note that for b £ [0, 1] we have 2 b < 1 + b, so 



E 



2 /Af(Zi,...,Zjv)| Z ^ 



->iV-l 



<E[1 + /jv(Zi, 

< (1 + ajv) 

< exp(aAr). 



Therefore 



N 



N 



^/ i (Zi,...,Z i )>t <exp(-Aft(log2) + aAr)E 2^=i /i^.-A) 



i=l 



Continuing in the same way yields 



\JjY^fi(Zl,...,Zi)>tj <exp ^-iVt(log2) + ^ ( 



(29) 



(30) 



□ 



The second technical lemma Csiszar and Narayan, 1991, Lemma 2] is a basic result 
about the distribution of inner product between a randomly chosen unit vector and any 
other fixed vector. It is a consequence of a result of Shannon [1959 on the distribution of 
the angle between a uniformly distributed unit vector and a fixed unit vector. 
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Lemma 10 (Lemma 2 of Csiszar and Narayan, 1991| ). Let U be uniformly distributed 
on the unit sphere S d_1 in MF. Then for every unit vector u on this sphere and any <f> € 
[(27rd) -1 / 2 , 1), the following inequality holds: 

P((U,u) ><j>) < (l-cj) 2 )^ 1 )/ 2 . (31) 

Lemma 11 (Packing set on the unit sphere). Let a dimension d and (ft £ [(27r<i) _1//2 , 1) be 
given. For N and t satisfying 

-ATt(log2) + N(N - 1)(1 - 2 )( rf - 1 )/ 2 < 1, (32) 

there exists a set of K = [(1 — t)N\ unit vectors C such that for all distinct pairs fj,, v £ C, 

\(fi,u)\<tt>. (33) 

Proof. The goal is to generate a set of ./V unit vectors on the surface of the sphere such 
that they have large pairwise distances, or correspondingly small pairwise inner products. 
To that end, define Zi, Z2, . . . , Zjy i.i.d. uniformly distributed on S d_1 and 

/ i (Z 1 ,...,Z l -) = l(|(Z i ,Z i }| >ct>, j<i). (34) 

That is, fi = 1 if Zj has large inner product with any Zj for j < i. The conditional 
expectation, by a union bound and Lemma [TO], is 

E [/i(Zi, . . . , Z,)|Z l7 . . . , Zi_i] < 2(i - 1)(1 - <A 2 ) (d - 1)/2 . (35) 

Let Oj = (i - 1)(1 - <p 2 )^- 1 ^ 2 . Then 

AT 

^a i = iV(iV-l)(l-(/) 2 )( d - 1 )/ 2 . (36) 

Then Lemma [9] shows 

2N(d-l)/2 x 



i=l 



^E/i( Z l.---» Z *)>^ <exp(-iVt(log2) + iV(iV-l)(l-a 2 



(37) 



This inequality implies that as long as 

-iVi(log2) + iV(iV - 1)(1 - ( /, 2 )( d - 1 )/ 2 < 1, (38) 

then there is a finite probability that {Zj} contains a subset {Z^} of size [(1 — t)N\ such 

that (Zj, Zj) < (p for all (i,j). Therefore such a set exists. □ 

A simple setting of the parameters gives the following packing. 
Lemma 12 (Simple packing set). For (p 6 [(27r<i) -1 / 2 , 1), there exists a set of 

K = ie X p((d-l)l„ g7r L=) (39) 

vectors C in S d_1 such that for any pair ix,v € C, the inner product between them satisfies 

\M\<<t>- (40) 
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Proof. Applying Lemma 11 yields a set of K vectors C satisfying (32) and (33). To get a 



simple bound that's easy to work with, we can set 

-NtQogi) + N(N - 1)(1 - 2 )( d - 1 )/ 2 



0, 



(41) 



and find an N close to this. Setting tp = (1 — (j) 2 )( d ^z 2 , the quadratic formula solving for 
N yields 



N = (tlog2 + V+ ((tlog2 + ip) 2 + 4^) 
t 

> 2^' 



1/2 



Now setting K = and t = 1/2 gives ( 39 ) . So there exists a set of -K" vectors on S d 1 

whose pairwise inner products are smaller than □ 

The maximum set of points that can be selected on a sphere of dimension d such that 
their pairwise inner products are bounded by <p is an open question. These sets are some- 
times referred to as spherical codes [Conway and Sloane 1998 because they correspond to 



a set of signaling points of dimension d that can be perfectly decoded over a channel with 
bounded noise. The bounds here are from a probabilistic construction and can be tightened 
for smaller d. However, in terms of scaling with d this construction is essentially optimal 
|Shannon 1959 . 



5.4 Lower bounds for input perturbation 

Lemma 13. Let S be a positive definite matrix and let f denote the density J\f(a, S) and 
g denote the density JV(6,E). Then KL (f\\g) = \{a - 6) T S~ 1 (a - b). 

Proof. This is a simple calculation: 



KL(/|| 5 ) 



1 



a T E" 1 ( 



(x - a) T TT l {x - a) + -(x - b)Z~ l (x - b) 
i - a T ^- 1 b - 6 T E" 1 a + 6 T IT x 6) 



\{a -bfTT 1 (a- 6). 



□ 

The next theorem is a lower bound on the expected distance between the vector output 
by MOD-SULQ and the true top eigenvector. In order to get this lower bound, we construct 
a class of data sets and use Fano's inequality to derive a bound on the minimax error over 
the class. 

Theorem 14 (Utility bound for MOD-SULQ). Let d, n, and a > be given and let f3 be 
given by ^ so that the output of MOD-SULQ is (a, S) -differentially private for all data sets 
in M. d with n elements. Then there exists a data set with n elements such that if v\ denotes 
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the output of MOD-SULQ and v\ is the top eigenvector of the empirical covariance matrix 
of the data set, the expected correlation (v\,vi) is upper bounded: 



E[|(f)i,ui) 



< min 

</>e* 



(1 



l//3 2 + log2 



(d-l)log 



log(8) 



(42) 



where 
$ G 



max 



exp 



21og(8d) 
d - 1 




exp 



2//3 2 + log(256) 
d- 1 




(43) 



shows there exists a set of K unit vectors C such 



Proof. For G [(2vrd)- 1 / 2 , 1), Lemma[l2 
that for the inner product between them satisfies \(fx, i>)\ < 4>, where K is given by 

(39). Note that for small <f> this setting of K is loose, but any orthonormal basis provides d 
unit vectors which are orthogonal, setting K = d and solving for (j) yields 

21og(8d)^ 1/2 



1 



exp 



d-1 

Setting the lower bound on cp to the maximum of these two yields the set of (j) and K which 
we will consider in ( 43 ) . 



For any unit vector ji, let 



A(fi) = iiijl t + N, 



(44) 



where iV is a d x d symmetric random matrix such that {Ny : 1 < i < j < d} are 
i.i.d. N(0,(3 2 ), where /3 2 is the noise variance used in the MOD-SULQ algorithm. Due to 
symmetry, the matrix A(/j,) can be thought of as a jointly Gaussian random vector on the 
d(d + l)/2 variables {Aij(fi) : 1 < i < j < d}. The mean of this vector is 



/i 



(45) 



and the covariance is 2 Id{d+\)/2- Let / M denote the density of this vector. 



For ix, v G C, the divergence between f^ and /„ can be calculated using Lemma 13 



KL{f tx \\f u )= l -{ix-u) T V-\ix-v) 

1 ,._ _„9 



< 



2/3 2 
1 

J 



2 ' 



(46) 



The last line follows from the fact that the vectors in C are unit norm. 

For any two vectors fx, v G C, lower bound the Euclidean distance between them using 
the upper bound on the inner product: 



> ^20" 



(47) 
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Let = S^ -1 with the Euclidean norm and 1Z be the set of distributions {A(p) : \i G ©}. 
From (47) and (46), the set C satisfies the conditions of Theorem [8] with T = {/ M : fx G C}, 
r = K , t = \/2(l — (ft), and 7 
MOD-SULQ, 



1 



The conclusion of the Theorem shows that for 



maxEf [ 1 1 -0 



Ml 



> 



V 2 (! " J) 



l//3 2 + log2 
log K 



(48) 



This lower bound is vacuous when the term inside the parenthesis is negative, which imposes 
further conditions on (p. Setting log If = 1//3 2 + log 2, we can solve to find another lower 
bound on 6: 



> 



exp 



2//3 2 + log(256) \ 
d-1 ' 



(49) 



This yields the third term in (43). Note that for larger n this term will dominate the others. 



Using Jensen's inequality on the the left side of (48): 

(1-4 



maxE /fi [2(1 -1(^)1)] > 



So there exists a fi G C such that 



1 



l//3 2 + log2 
logiC 



E /M [|(t),M)|]<l 



(1 



l//3 2 + log2 
logK 



(50) 



Consider the data set consisting of n copies of \i. The corresponding covariance matrix is 
• T with top eigenvector v\ = ji. The output of the algorithm MOD-SULQ applied to this 



data set is an estimator of /x and hence satisfies (50). Minimizing over 
bound. 



gives the desired 
□ 



The minimization over (f> in (42) does not lead to analytically pretty results, but numeri- 
cal optimization can give some insight into these bounds. In all experiments we set 5 = 0.01. 
Figure [T] shows the correlation as a function of n for different dimensions and different val- 
ues of a. In high dimension, the lower bound is shows that the expected performance of 
MOD-SULQ is poor when there are a small number of data points. This limitation may be 
particularly acute when the data lies in a very low dimensional subspace but is presented 
in very high dimension. In such "sparse" settings, perturbing the input as in MOD-SULQ is 
not a good approach. However, in lower dimensions and data-rich regimes, the performance 
may be more favorable. 

A little calculation yields the sample complexity bound. 



Proof of Theorem^ Suppose E [| (vi,v\) 



p. Then a little algebra shows 



2 a/1 - p > min Jl 

</>e* 



l//3 2 + log2 



(d-1) 



log(8) 
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Correlation versus alpha for different n 

128 256 




Data set size 

Figure 1: Upper bound on the correlation between \ {v\, v\)\ for MOD-SULQ. The horizontal 
axis is the size of the data set n, and the vertical axis is the correlation. The four panels 
correspond to values of d = 64, 128, 256, and 1024. 

Set such that (d - 1) log - 7 J= - log(8) = 2(l//3 2 + log 2) to get 

y/l-<f> 2 

Vi -p> ^Ji-<p. 

Since we are concerned with the scaling behavior for large d and n, 
so 

Lower bound /3 in ^ to get for some constant c\, 

d 2 

/3 2 > ci-z-ylog(d/5). 
Substituting this we get for some constant C2 that 

(l-C 2 (l-p))<C3 d 3 log(d/J) . 
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Now solving for n shows 



n> c 



(i 3 /yiog(d/£) 



(1 - c'(l - p)) 



□ 



6 Implementation and Experiments 

We implemented MOD-SULQ and PPCA in order to test our theoretical results. Implement- 
ing PPCA involved using a Gibbs sampling procedure |Hoff 2009 . A crucial parameter in 
MCMC procedures is the burn-in time, which is how long the chain must be run for it to 
reach its stationary distribution. Theoretically, chains reach their stationary distribution 
only in the limit; however, in practice MCMC users must sample after some finite time. 
In order to use this procedure appropriately, we determined a burn-in time using our data 
sets. The interaction of MCMC procedures and differential privacy is a rich area for future 
research. 



6.1 Data and preprocessing 

We report on the performance of our algorithm on some real datasets. 
datasets from four different domains - kddcup99 
features of 494,021 network connections, census 



Hettich and Bay 



1999 



Asuncion and Newman, 2007 



We chose four 
which includes 
a demo- 



graphic data set on 199,523 individuals, localization [Kaluza et al. 2010 , a medical 
dataset with 164,860 instances of sensor readings on individuals engaged in different activ- 
ities, and insurance |van der Putten and van Someren 2000] , a dataset on product usage 
and demographics of 9,822 individuals. 

These datasets contain a mix of continuous and categorical features. We preprocessed 
each dataset by converting a feature with q discrete values to a vector in {0, l} 9 ; after pre- 
processing, the datasets kddcup99, census, localization and insurance have dimensions 
116, 513, 44 and 150 respectively. We also normalized each row so that each entry has 
maximum value 1, and normalize each column such that the maximum (Euclidean) column 
norm is 1. We choose k = 4 for kddcup, k = 8 for census, k = 10 for localization and 
k = 11 for insurance; in each case, the utility t/F^fc) of the top-k PCA subspace of the 
data matrix accounts for at least 80% of ||-A|| F . Thus, all four datasets, although fairly high 
dimensional, have good low-dimensional representations. The properties of each dataset are 
summarized in Table 16.11 



6.2 Implementation of Gibbs sampling 

The theoretical analysis of PPCA uses properties of the Bingham distribution BMF / t(-) given 



in (10). To implement this algorithm for experiments we use a Gibbs sampler due to iHoff 



2009 . The Gibbs sampling scheme induces a Markov Chain, the stationary distribution of 



which is the density in (10). Gibbs sampling and other MCMC procedures are widely used 



in statistics, scientific modeling, and machine learning to estimate properties of complex 



distributions Brooks 1998 
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Dataset 


^instances 


^dimensions 


k 


q F (U)/\\A\\ F 


kddcup 


494,021 


116 


4 


0.96 


census 


199,523 


513 


8 


0.81 


localization 


164,860 


44 


10 


0.81 


insurance 


9,822 


150 


11 


0.81 



Table 1: Parameters of each dataset. The second column is the number of dimensions 
after preprocessing, k is the dimensionality of the PCA, and the fourth column contains 
Qf{U)/ \\A\\ f where U is the top k PCA subspace. 



Finding the speed of convergence of MCMC methods is still an open area of research. 



There has been much theoretical work on estimating 


convegence times 


Jones and Hobart 


2004, Douc et al. 2004, Jones and Hobart, 2001 Roberts 


1999 


Roberts and Sahu, 2001 


Roberts 


1999 




Roberts and Sahu 


2001 


Rosenthal 




1995 




Kolasa 




1999 




2000 , but unfor- 



tunately, most theoretical guarantees are available only in special cases and are often too 
weak for practical use. In lieu of theoretical guarantees, users of MCMC methods em- 
pirically estimate the burn-in time, or the number of iterations after which the chain is 
sufficiently close to its stationary distribution. Statisticians employ a range of diagnostic 
methods and statistical tests to empirically determine if the Markov chain is close to sta- 
tionarity Cowles and Carlin, 1996, Brooks and Roberts, 1998, Brooks and Gelman 1998 



El Adlouni et al. 



2006 



These tests do not provide a sufficient guarantee of stationarity, 
and there is no "best test" to use. In practice, the convergence of derived statistics is used 



to estimate an appropriate the burn-in time. In the case of the Bingham distribution, Hoff 



2009 performs qualitative measures of convergence. Developing a better characterization 
of the convergence of this Gibbs sampler is also an important question for future work. 

To choose an appropriate burn-in time, we examined the time series trace of the Markov 
Chain. We ran I copies of the chain, starting from I different initial locations drawn uni- 
formly from the set of all d x k matrices with orthonormal columns. Let X l (t) be the output 
of the i-th copy at iteration t, and let U be the top k PCA subspace of the data. For each 
i, we plot the magnitude of the projection of X l (t) onto U. After a number of iterations, 
the projections should converge to the same value. 

For each copy, we also plot the following statistic as a function of iteration T: 



4(T) 



1 



'k 



t=i 



where || • \\p is the Frobenius norm. The matrix Bingham distribution has mean 0, and 
hence with increasing T, the statistic F^{T) should converge to 0. 

In our simulations, we observed that the Gibbs sampler converges to Fk(t) < 0.01 at 
t = 20,000 when run on data with a few hundred dimensions and with k between 5 and 10; 
we thus chose to run the Gibbs sampler for T = 20,000 timesteps for all the datasets. We 
show log F^(T) as a function of iteration T for datasets insurance and kddcup in Figures [3] 
and[2]respectively; the plots are over 5 trajectories of the Markov Chain, which are initialized 
at 5 locations drawn uniformly from the set of all d x k matrices with orthonormal columns. 
The plots show that F%,(T) decreases rapidly after a few thousand iterations, and is less 
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Figure 2: Plot of log F k (T) for k = 4 as a 
function of iteration T for 40, 000 iterations 
of the Gibbs sampler for the kddcup dataset. 



Figure 3: Plot of logi^T) for k = 11 as a 
function of iteration T for 40, 000 iterations of 
the Gibbs sampler for the insurance dataset. 



than 0.01 after T = 20,000 in both cases, log F^{T) also appears to have a larger variance 
for kddcup than for insurance; this is explained by the fact that the kddcup dataset has 
a much larger number of samples, which makes its stationary distribution farther from the 
initial distribution of the sampler. 

Our simulations indicate that the chains converge fairly rapidly, particularly when 
1 1 .A — -AfcHp is small so that A k is a good approximation to A. Convergence is slower for 
larger n when the initial state is chosen from the uniform distribution over all k x d matrices 
with orthonormal columns; this is explained by the fact that for larger n, the stationary 
distribution is farther in variation distance from the starting distribution, which results in 
a longer convergence time. 



6.3 Scaling with data set size 

We ran three algorithms on these data sets : standard (non-private) PCA, MOD-SULQ, and 
PPCA. As a sanity check, we also tried a uniformly generated random projection - since 
this projection is data-independent we would expect it to have low utility. We measured 
the utility qp(U), where U is the ^-dimensional subspace output by the algorithm; q$U is 
maximized when U is the top-A: PCA subspace, and thus this reflects how close the output 
subspace is to the true PCA subspace in terms of representing the data. Although our 
theoretical results hold for (?a(")> the "energy" (?f( - ) is more relevant in practice for larger 
k. 

The results for utility qp(U) are shown in Figures 4(a) [ |4(b") 4(c), and |4(d)"| each plot 



shows qF(U) as a function of sample size for the fc-dimensional subspace output by PPCA, 
MOD-SULQ, non-private PCA, and random projections. PPCA was run with privacy pa- 
rameter a = 0.1; MOD-SULQ with a = 0.1 and 5 = 0.01. Each value in the figure is an 
average over 5 random permutations of the data, as well as 10 random starting points of 
the Gibbs sampler per permutation (for PPCA), and 100 random runs per permutation (for 
MOD-SULQ and random projections). 

The plots show that PPCA always outperforms MOD-SULQ, and approaches the per- 
formance of non-private PCA with increasing sample size. By contrast, for most of the 
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problems and sample sizes considered by our experiments, MOD-SULQ does not perform 
much better than random projections. The only exception is localization, which has 
much lower dimension (44). This confirms that MOD-SULQ does not scale very well with 
the data dimension d. The performance of both MOD-SULQ and PPCA improve as the sam- 
ple size increases; the improvement is faster for PPCA than for MOD-SULQ. However, to be 
fair, MOD-SULQ is simpler and hence runs faster than PPCA. At the sample sizes in our 
experiments, the performance of non-private PCA does not improve much with a further 
increase in samples. Our theoretical results suggest that the performance of differentially 
private PCA cannot be significantly improved over these experiments. 
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Figure 4: Utility qF(U) for different data sets. 



6.4 Effect of privacy on classification 



A common use of a dimension reduction algorithm is as a precursor to classification or 
clustering; to evaluate the effectiveness of the different algorithms, we projected the data 
onto the subspace output by the algorithms, and measured the classification accuracy using 
the projected data. The classification results are summarized in Table 6.4 We chose the 
normal vs. all classification task in kddcup99, and the falling vs. all classification task in 
localization, r] We used a linear SVM for all classification tasks, which is implemented 



by libSVM Chang and Lin, 2011 



For the classification experiments, we used half of the data as a holdout set for computing 
a projection subspace. We projected the classification data onto the subspace computed 



For the other two datasets, census and insurance, the classification accuracy of linear SVM after 
(non-private) PCAs is as low as always predicting the majority label. 
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KDDCUP 


LOCALIZATION 


Non-private PCA 


98.97 ±0.05 


100 ±0 


PPCA 


98.95 ±0.05 


100 ±0 


SULQ 


98.18 ±0.65 


97.06 ±2.17 


Random Projections 


98.23 ±0.49 


96.28 ±2.34 



Table 2: Classification accuracy in the fc-dimensional subspaces for kddcup99(fc = 4), and 
localization^ = 10) in the /c-dimensional subspaces reported by the different algorithms. 

based on the holdout set; 10% of this data was used for training and parameter-tuning, 
and the rest for testing. We repeated the classification process 5 times for 5 different 
(random) projections for each algorithm, and then ran the entire procedure over 5 random 
permutations of the data. Each value in the figure is thus an average over 5 x 5 = 25 rounds 
of classification. 

The classification results show that our algorithm performs almost as well as non-private 
PCA for classification in the top k PCA subspace, while the performance of MOD-SULQ and 
random projections are a little worse. The classification accuracy while using MOD-SULQ 
and random projections also appears to have higher variance compared to our algorithm 
and non-private PCA; this can be explained by the fact that these projections tend to be 
farther from the PCA subspace, in which the data has higher classification accuracy. 

6.5 Effect of the privacy requirement 

To check the effect of the privacy requirement, we generated a synthetic data set of n = 5,000 
points drawn from a Gaussian distribution in d = 10 with mean and whose covariance 
matrix had eigenvalues 

{0.5, 0.30, 0.04, 0.03, 0.02, 0.01, 0.004, 0.003, 0.001, 0.001}. 

In this case the space spanned by the top two eigenvectors has most of the energy, so we 
chose k = 2 and plotted the utility <?f(-) for non-private PCA, MOD-SULQ with 5 = 0.05, 
and PPCA with a burn-in time of T = 1000. We drew 100 samples from each privacy- 
preserving algorithm and the plot of the average utility versus a is shown in Figure [5} The 
privacy requirement relaxes as a increases, and both MOD-SULQ and PPCA approach the 
utility of PCA without privacy constraints. However, for moderate a PPCA still captures 
most of the utility, whereas the gap between MOD-SULQ and PPCA becomes quite large. 

7 Discussion 

In this paper we investigated the theoretical and empirical performance of differentially 
private approximations to PCA. Empirically, we showed that MOD-SULQ and PPCA differ 
markedly in how well they approximate the top-k subspace of the data. The reason for 
this, theoretically, is that the sample complexity of MOD-SULQ is 0(cf 3 / 2 -y/log d) whereas 
the sample complexity of PPCA is 0{d). Because PPCA uses the exponential mechanism 
with </f( - ) as the utility function, it is not surprising that it performs well. However, MOD- 
SULQ often had a performance comparable to random projections, indicating that the real 
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Figure 5: Plot of Qf(U) versus a for a synthetic data set with n = 5,000, d = 10, and k = 2. 



data sets we used were too small for it to be effective. We furthermore showed that PPCA 
is nearly optimal, in that any differentially private approximation to PCA must use £l(d) 
samples. An open question is to find algorithms that guarantee weaker (a, <5)-differential 
privacy in exchange for better dependence on the dimension d, analogous to |Hall et "ah 



2012 



Our investigation brought up many interesting issues to consider for future work. Be- 
cause differential privacy is a mathematical definition, the description of differentially pri- 
vate procedure makes a number of idealizations regarding computation. Some of these 
idealizations are related to running the algorithm in a real environment; the designers of 
differentially private systems such as Airavat Roy et al. |2010| require additional security 
assumptions that have to be verified. At a more basic level, the difference between truly 
random noise and pseudorandomness Mironov et al. 2009 , McGregor et al. [2010 and the 
effects of finite precision can lead to a gap between the theoretical ideal and practice. Fi- 
nally, implementation of private algorithms can lead to further gaps between theory and 
practice. For example, implementing objective perturbation Chaudhuri et al. [201 1 uses 
numerical optimization tools for approximate solutions to convex optimization problems, 
which have complex termination conditions that are not part of the accompanying theo- 
retical analysis. Our implementation of PPCA does not sample exactly from the Bingham 
distribution, and we leave a theoretical investigation of the impact of approximate sampling 
for future work. 

Finally, more germane to the work on PCA here is to prove sample complexity results 
for general k rather than the case k = 1 here. For k = 1 the utility functions qF(-) and (/a(') 
are related, but for general k it is not immediately clear what metric best captures the idea 
of "approximating" PCA. In particular, there is a difference between producing an approx- 
imation to the top-/c PCA subspace (the problem considered here) and an approximation 
to the top k eigenvectors. Developing a framework for such approximations is of interest 
more generally in machine learning. 
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